A Tutorial of Customer Churn Analysis & Prediction

Data Visualization, EDA, and Churn Prediction Using ML Algorithms for Music Streaming Service

  1. Introduction
  2. Data Preparation
  3. Exploratory Data Analysis(EDA): Churned Vs. Stayed
  4. Customer Churn Prediction
  5. Conclusion
  6. Reference

back to top

1. Introduction

Predicting churn rates is very challenging so many data scientist and anlysts struggles in any customer-facing business. Since the user-interaction services like Spotify requires to communicate with customers frequently, there are large amount of logging data every day. Thus, in this project, I would like to show how to manipulate large and realistic datasets with Spark, as well as how to build the prediction model with Spark MLlib. Let's dive in!

Photo by Filip Havlik on Unsplash

back to top

2. Data Preparation

First, you need to import lot's of Spark libralies as below, then you can start opening the instance of SparkSession to wranlge the big data.

back to top

2.1. Load and Clean Missing Data

Load Data

In this project, we are going to usethe website logging data collected from the virtual music streaming company, called "Sparkify". The original size of this dataset is 12GB, but we can start exploring data with the subset of them -mini_sparkify_event_data.json.

Clean Missing Data

Now it's time to clean up some empty and invalid data. If you run the code below, you can see that there are some users with empty string, who probably not regular users in Sparkify. So we can discard those users from our dataset. Also, you can see that there are a lot of None values in "song" and other columns, but we don't need to take care of them for now (We are going to transform this dataset for each user so those song and artists data are not critical)

back to top

2.2. Build the Webstie-Log Dataset (df_log)

Understand the Website Log data

I wrote down some descriptions of the Sparkify user transaction data in the below table. Also I highlighted some important variables that used in this analysis.

Column Type Description Comment
ts (long) the timestamp of user log
sessionId (long) an identifier for the current session
auth (string) the authentification log 'LoggedIn','LoggedOut', 'Guest', 'Cancelled'
itemInSession (long) the number of items in a single session
method (string) HTTP request method 'put', 'get'
status (long) http status 307:Temporary Redirect, 404:Not Found, 200:OK)
userAgent (string) the name of HTTP user agent
userId (string) the user Id
gender (string) the user gender 'F', 'M'
location (string) the user location
firstName (string) the user first name
lastName (string) the user last
registration (long) the timestamp when user registered
level (string) the level of user subscription 'free','paid'
page (string) the page name visited by users 'Home', 'Login', 'LogOut','Settings', 'Save Settings','about', 'NextSong',
'Thumbs Up', 'Thumbs Down', 'Add to Playlist','Add Friend', 'Roll Advert',
'Upgrade', 'Downgrade', 'help','Submit Downgrade', 'Cancel','Cancellation Confrimation'
artist (string) the name of the artist played by users
song (string) the name of songs played by users
length (double) the length of a song in seconds

Define Churn

The customer churn is defined when existing customers cancel the subscription. In this project, I define the churn status as 1 when a user visit the page Cancellation Confirmation only. Since this datset shows only two months, if someone submit the downgrade before Oct, the level of the person is 'free', but not churned! Thus we need to analyze churn rate for both free and paid users.

Build The Website-Log Dataframe (df_log) by Adding More Columns

From the cleaned dataset (df_clean), we can build the website-log dataframe (df_log) with our target variable churn and some additional columns as below.

back to top

2.3. Transform the Webstie Logs into the User-log Data by Aggregation

In order to predict churn status for users, the website-log data needs to be transformed for each user. First, we need to discard some columns that are not related to customer churn events such as session logs and user names. Then, we can transform data based on userId and there are two types of data: user information and user activities. User information columns in our data are churn, gender, level, and locCity, which must be the same for each user.

For user activity data, we need to aggregate the logging data to create some meaningful features. I listed the new columns that I added to the user-log dataset below. 

back to top

3. Exploratory Data Analysis (EDA): Churned Vs. Stayed

For the visualization, I used the Plotly libray to make interactive plots. The main code is long so I hide the cells below. If you want to check, please expand the cell below.

back to top

4. Customer Churn Prediction

Now it's time to build a model and predict if a user is going to churn or not. With this information, the business model can be personalized, such as providing special promotions. For this, we need Feature Engineer first, and then find the proper model, and finally, we can deploy this model to the real business in the future.

4.1.Engineer Features

From the visualization, we can finally select our features for the prediction model by modifying the user-log dataset as below.

back to top

4.2. Build the Model Pipeline

Let’s start to build the pipeline for the prediction model using ML libraries in Spark. For better cross-validation, I combined all feature transformations and an estimator into one pipeline and feed it into the CrossValidator. There are three main parts of the pipeline.

  1. Feature Transformation: category variables will be transformed to one-hot encoded vectors by StringIndexer and OneHotEncoder. Then, the categorical vector and numerical variables will be assembled into a dense feature vector using VectorAssembler.

  2. Feature Importance Selection: I built a custom FeatureSelector class to extract only important features using a Tree-based estimator. This step is optional so that I didn’t use it for Logistic Regression or LinearSVC models.

  3. Estimator: The final step is using ML algorithms to estimate the churn label for each user.

back to top

4.4. Tune Hyper Parameters

Finally, I ran the cross-validation to tune the hyper-parameters of the RandomForestClassifier. Since our dataset is very small, you can observe an almost perfect train score, pointing that model is overfitted. Thus, I selected some cross-validation parameter maps to make the model less complex compared to the default model that I used for the above model selection (numTrees=20). The result shows that the model with 10 trees and 16 max bins has slightly better performance but it didn’t overcome the overfitting problem well. I assume that this problem can be solved by adding more data.

back to top

5. Conclusion

In this project, we went through how to analyze the website log data and build the churn prediction model step by step. Although we used the fake website log and real website logs would be more nasty and huge, I hope that this project gives you the overview or tutorial that you can start with.

Since the mini-dataset(123MB) is not enough for our model, I would like to expand this project with the large 12GB dataset using AWS to overcome the overfitting problem in the future.

Thanks for reading and love to connect with you anytime via my LinkedIn!

back to top

6. Reference